Introduction

Sale Price Prediction Model

The main goal of this study is to develop a model to predict the sale price of a house. A well performing model should provide an acceptable level of predictive accuracy while using only a limited number of features with the aim of reducing complexity and overfitting. A model of reduced size seeks to acheive a useful balance between the prediction and explanation goals.

Effect of Neighborhood on Living Area Price

For this study, we will also perform some exploratory data analysis to explain how the effect of a continous predictor on the response varies across different treatment groups of a categorical predictor. Namely, we’ll test:

\(H_0: \beta_{j,LivArea:Neighborhood} = 0, \forall j \text{ - The mean price per square foot of a house is the same for every neighborhood j }\)

\(H_1: \beta_{j,LivArea:Neighborhood} \neq 0, \forall j \text{ - At least one neighborhood j has a different mean price per square foot than others }\),

using a confidence level of \(\alpha= 0.01\).

We will also attempt to infer which neighborhoods place the highest monetary values on living area and determine whether the living area predictor is collinear with other predictors.

Housing Dataset

To develop the model, we will use a dataset describing houses sold between 2006 and 2010 in Ames, Iowa. This data set was compiled by Dean DeCock in 2011, and it provides over 2800 observations with 81 features (23 nominal/categorical, 23 ordinal, 14 discrete, and 21 continuous).

The features can be roughly divided into 6 categories.

  • House characteristics: e.g. number of bedrooms, number of bathrooms, electrical system
  • Space measurements: e.g. lot area, living area
  • Quality ratings: e.g. overall condition, kitchen condition
  • Construction characteristics and materials: e.g. roof material
  • Zone and location classification: e.g. zoning classification, neighborhood
  • Terms of sale: e.g. month and year sold, sale price

Methods

Data Cleaning

We will begin by loading the housing data stored in AmesHousing.csv.

library(readr)
housing_data = read_csv("data/AmesHousing.csv")

There are two variables that do not describe house qualities. PID is the unique identifier of each observation and Order is a counter for each observation.

### Remove non-relevant variables
housing_data = subset(housing_data, select = -Order)
housing_data = subset(housing_data, select = -PID)

Next, we’ll inspect the strucutre of the dataset.

dim(housing_data) ## Dimension of the dataset.
## [1] 2930   80
table(sapply(housing_data, class)) ## Variable types
## 
## character   integer 
##        44        36

We can observe that the dataset is composed by 2930 observations and 80 variables. There are only character and integer variables. A description of each variable can be found at: https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt.

Null Values

We’ll examine whether there are NA values in the dataset and decide how to manage them.

hasNA = colSums(is.na(housing_data))[colSums(is.na(housing_data)) > 0]
hasNA
##   Lot Frontage          Alley   Mas Vnr Type   Mas Vnr Area      Bsmt Qual 
##            490           2732             23             23             80 
##      Bsmt Cond  Bsmt Exposure BsmtFin Type 1   BsmtFin SF 1 BsmtFin Type 2 
##             80             83             80              1             81 
##   BsmtFin SF 2    Bsmt Unf SF  Total Bsmt SF     Electrical Bsmt Full Bath 
##              1              1              1              1              2 
## Bsmt Half Bath   Fireplace Qu    Garage Type  Garage Yr Blt  Garage Finish 
##              2           1422            157            159            159 
##    Garage Cars    Garage Area    Garage Qual    Garage Cond        Pool QC 
##              1              1            159            159           2917 
##          Fence   Misc Feature 
##           2358           2824

27 variables contain NA values. Some of those variables have a high number NAs, therefore we decided to remove any variable with more than 400 NAs:

# Keep only those variables that has less than 400 NAs 
housing_data = subset(housing_data, select = colSums(is.na(housing_data)) < 400)

After filtering, we now have 2930 observations and 74 variables.

In order to avoid any issues with the remaining observations that have NAs we are removing those as well:

housing_data = na.omit(housing_data)

After filtering, we now have 2678 observations and 74 variables.

Unusual Observations

TODO: Is this fair? Should we come up with a another justification other than citing the author? Not sure…

There are 5 unusual observations that were pointed out by the author of the dataset. These are easily observable by plotting SalePrice (price of the house) vs Gr Liv Area (above ground living area square feet) as follows:

plot(housing_data$SalePrice ~ housing_data$`Gr Liv Area`,
     col = ifelse(housing_data$`Gr Liv Area` > 4000, "orange", "dodgerblue"),
     xlab = "Ground Living Area (square feet)",
     ylab = "Sale Price (US Dollars)",
     main = "Sale Price vs Ground Living Area",
     pch = 20,
     cex = ifelse(housing_data$`Gr Liv Area` > 4000, 2, 0.7)
     )

We observe that 3 points that have an extremely low price for a huge ground living area. The other two points seem to be priced correctly but are heavy outliers.

## Remove observations with living area larger than 4000 sq ft.
housing_data = housing_data[housing_data$`Gr Liv Area` < 4000, ]

We also note that two neighborhoods have less than 3 observations. A sample size smaller than 3 for a categorical group is not sufficiently large to derive meaningful estimates.

table(housing_data$Neighborhood)
## 
## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert 
##      28      10      29      93      42     262     101     139     159 
##  Greens GrnHill  IDOTRR Landmrk MeadowV Mitchel   NAmes NoRidge NPkVill 
##       8       1      69       1      25     101     416      69      23 
## NridgHt  NWAmes OldTown  Sawyer SawyerW Somerst StoneBr   SWISU  Timber 
##     163     131     203     140     107     170      51      38      70 
## Veenker 
##      24
## Remove observations belonging neighborhoods smaller than 3
housing_data = housing_data[-which(housing_data$Neighborhood %in% c("GrnHill","Landmrk")),]

Lastly, it is necesary to coerce all character columns as factors to fit the categorical predictors of the model:

# Determine variables that are of type character
char_var = lapply(housing_data, class) == "character"
# Coerce those columns to factor
housing_data[, char_var] = lapply(housing_data[, char_var], as.factor)

Baseline analysis

As stated in the Introduction the goal of the project is to obtain a model for predicting sales price, but trying to keep it simple. In other words, finding a good balance between prediction and explanation.

Before playing around with model selection, let’s divide the data into a training and a test set. That will help us evaluate the final model chosen. We’ll use the seed 420 for any random process so we can reproduce the results if needed.

set.seed(420)

# Get training indexes randomly (80% of the observations)
train_index <- sample(seq_len(nrow(housing_data)), size = floor(0.8 * nrow(housing_data)))

# Divide the data
train_hd <- housing_data[train_index, ]
test_hd <- housing_data[-train_index, ]

After splitting the data we now have 2136 observations for training and 535 observations for test.

In order to assess our models we are going to use the set of auxiliary functions defined on week’s 9 assignment. We are also defining a macro function that calls all those auxiliary functions and show the results in a single call. This overview encompasses the p-value of the normality and constant variance assumptions through Shapiro and Breusch test, the Fitted vs Residuals plot, the normal Q-Q plot, the number of parameters used (betas), the LOOCV RMSE, and the adjusted \(R^2\).

# Library required for Breusch-Pagan test
library(lmtest)

# Plot Fitted vs Residuals
plot_fitted_resid = function(model) {
  plot(fitted(model), resid(model), 
       col = "dodgerblue", pch = 20, cex = .7,
       xlab = "Fitted", ylab = "Residuals",
       main = "Fitted vs residuals plot")
  abline(h = 0, col = "darkorange", lwd = 1)
}

# Plot Normal Q-Q
plot_qq = function(model) {
  qqnorm(resid(model), col = "dodgerblue", pch = 20, cex = .7)
  qqline(resid(model), col = "darkorange", lwd = 1)
}

# Breusch-Pagan test (constant variance)
get_bp_decision = function(model) {
  bptest(model)$p.value
}

# Shapiro-Wilk test (normality)
get_sw_decision = function(model) {
  shapiro.test(resid(model))$p.value
}

# Number of parameters
get_num_params = function(model) {
  length(coef(model))
}

# LOOCV RMSE
get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

# Adjusted R^2
get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}

# Function that combines the previously defined functions
get_overview = function(model) {
  par(mfrow=c(1,2)) #Plot two graphs in the same image
  plot_fitted_resid(model)
  plot_qq(model)
  loocv_rmse = get_loocv_rmse(model)
  adj_r2 = get_adj_r2(model)
  bp_p_value = get_bp_decision(model)
  shapiro_p_value = get_sw_decision(model)
  betas = get_num_params(model)
  list(loocv_rmse = loocv_rmse, adj_r2 = adj_r2, bp_p_value = bp_p_value, shapiro_p_value = shapiro_p_value, betas = betas)
}

We can start the process of finding our model by using an additive model as a baseline:

model_add = lm(SalePrice ~ ., data = train_hd)

Let’s run the overview function to assess this baseline model:

get_overview(model_add)

## $loocv_rmse
## [1] Inf
## 
## $adj_r2
## [1] 0.9334242
## 
## $bp_p_value
##           BP 
## 1.341791e-22 
## 
## $shapiro_p_value
## [1] 1.678226e-32
## 
## $betas
## [1] 249

From the results above you can note that there are troubles on every resulting parameter, except the adjusted \(R^2\). LOOCV RMSE is inf (because there are hat values equal to 1), both assumptions are violated (p-values are very low), and the number of parameters used is huge (because of the categorical variables, we end up having more betas than the number of predictors available). The Fitted vs Residuals plot also seems to violate both, the 0 mean at any fitted value and the constant variance at any fitted value (no linearity and no constant variance).

Based on the normal Q-Q plot is evident that there is a huge deviation at the edges of the plot. We can apply a log transformation on the response and check whether this minimizes the effect:

model_add_log = lm(log(SalePrice) ~ ., data = train_hd)
get_overview(model_add_log)

## $loocv_rmse
## [1] Inf
## 
## $adj_r2
## [1] 0.9362686
## 
## $bp_p_value
##           BP 
## 4.827006e-17 
## 
## $shapiro_p_value
## [1] 1.375178e-35
## 
## $betas
## [1] 249

The Shapiro’s p-value is worse, but the normal Q-Q plot looks better. This can be due to some outliers that can be observed on the plot.

The outliers can be identified by using plot function:

par(mfrow=c(2,2))
plot(model_add_log)

From the resulting plots we can see that observations 1682, 578, and 213 seem to be outliers. Let’s remove them:

train_hd = train_hd[-c(1682, 578, 213),]
model_add_log = lm(log(SalePrice) ~ ., data = train_hd)
get_overview(model_add_log)

## $loocv_rmse
## [1] Inf
## 
## $adj_r2
## [1] 0.9449537
## 
## $bp_p_value
##           BP 
## 7.455871e-24 
## 
## $shapiro_p_value
## [1] 7.691894e-24
## 
## $betas
## [1] 249

It now looks a bit better, but there is still a noticeable violation of normality at the edges. The Fitted vs Residuals plot also looks better, especially the 0 mean at any fitted value.

We tried to find outliers or influential points, but using the heuristics of the book is not possible to find any:

# Number of influential points in the transformed model
sum(cooks.distance(model_add_log) > 4 / length(cooks.distance(model_add_log)))
## [1] NA
# Number of outliers in the transformed model
sum(abs(rstandard(model_add_log)) > 2)
## [1] NA

We can also see the positive effect of the log function in the response by looking at its histogram. Without log:

hist(train_hd$SalePrice, 
     breaks = 30,
     main = "Histogram of Sale Price without transformation",
     xlab = "Sale Price",
     col = "gray")

As you can see this plot is right skewed. This can be normalized by using log:

hist(log(train_hd$SalePrice), 
     breaks = 30,
     main = "Histogram of Sale Price with log transformation",
     xlab = "Sale Price",
     col = "gray")

We can notice that the obtained plot is much better and the log transformation makes sense for the response.

Let’s use this transformed additive model as our baseline. There are still many paramaters that can be improved.

Selection methods

There are many predictors to play with. We can try to choose a smaller base of “relevant” numerical predictors and then assess the addition of categorical variables through a forward process.

One way of doing this is by selecting those numerical predictors that are highly correlated to the response.

# Get correlation list of all numerical variables against Sale Price
sale_price_cor = cor(train_hd[sapply(train_hd, is.numeric)])[,"SalePrice"]
sale_price_cor
##        Lot Area    Overall Qual    Overall Cond      Year Built 
##      0.27204535      0.79426380     -0.14247853      0.54446953 
##  Year Remod/Add    Mas Vnr Area    BsmtFin SF 1    BsmtFin SF 2 
##      0.53277472      0.49309078      0.42169310     -0.02516887 
##     Bsmt Unf SF   Total Bsmt SF      1st Flr SF      2nd Flr SF 
##      0.16654274      0.65473647      0.65230975      0.25015743 
## Low Qual Fin SF     Gr Liv Area  Bsmt Full Bath  Bsmt Half Bath 
##     -0.02144407      0.73065152      0.27390192     -0.07082941 
##       Full Bath       Half Bath   Bedroom AbvGr   Kitchen AbvGr 
##      0.55217520      0.25680213      0.14516273     -0.08280057 
##   TotRms AbvGrd      Fireplaces   Garage Yr Blt     Garage Cars 
##      0.52189041      0.45534222      0.52975789      0.65898016 
##     Garage Area    Wood Deck SF   Open Porch SF  Enclosed Porch 
##      0.65081976      0.30463025      0.32587883     -0.11539154 
##      3Ssn Porch    Screen Porch       Pool Area        Misc Val 
##      0.02536506      0.10758806      0.04195898     -0.01490168 
##         Mo Sold         Yr Sold       SalePrice 
##      0.05004228     -0.02129590      1.00000000

We can then remove those values that are low in magnitude. We’ll choose a threshold of 0.4:

# Filter correlations with magnitude smaller than a specified threshold
sale_price_cor = sale_price_cor[abs(sale_price_cor) > 0.4]
sale_price_cor
##   Overall Qual     Year Built Year Remod/Add   Mas Vnr Area   BsmtFin SF 1 
##      0.7942638      0.5444695      0.5327747      0.4930908      0.4216931 
##  Total Bsmt SF     1st Flr SF    Gr Liv Area      Full Bath  TotRms AbvGrd 
##      0.6547365      0.6523098      0.7306515      0.5521752      0.5218904 
##     Fireplaces  Garage Yr Blt    Garage Cars    Garage Area      SalePrice 
##      0.4553422      0.5297579      0.6589802      0.6508198      1.0000000

We now have a smaller group of numerical predictors to choose from. By reading the documentation of this set of parameters one can see that some of them are the result of the others or are very related. This obviously leads to collinearity issues.

Year built and year of remodelation are repetitive. If there was no remodelation the year built is used on Year Remod/Add. Let’s keep Year Remod/Add only.

BsmtFin SF 1 and Total Bsmt SF are also related. Let’s keep Total Bsmt SF which has a higher correlation with SalePrice. Same happens with 1st Flr SF and Gr Liv Area, we’ll keep Gr Liv Area.

There are also 3 numerical variables related to the garage. The Garage Cars seems to be the most relevant.

We can now simplify our sale_price_cor vector with the chosen variables:

# Auxiliary vector with names
var = names(sale_price_cor)

sale_price_cor = sale_price_cor[var != "Year Built" & var != "BsmtFin SF 1" & var != "1st Flr SF" & var != "Garage Yr Blt" & var != "Garage Area"]

sale_price_cor
##   Overall Qual Year Remod/Add   Mas Vnr Area  Total Bsmt SF    Gr Liv Area 
##      0.7942638      0.5327747      0.4930908      0.6547365      0.7306515 
##      Full Bath  TotRms AbvGrd     Fireplaces    Garage Cars      SalePrice 
##      0.5521752      0.5218904      0.4553422      0.6589802      1.0000000

Let’s fit a model with the chosen numerical variables and finish the collinearity study using vif analysis. Don’t forget that we are transforming the response using log.

# Fit a simple additive model with the chosen numerical variables
model_add_num = lm(log(SalePrice) ~ ., data = train_hd[, names(sale_price_cor)])

# Run vif test
library(faraway)
## 
## Attaching package: 'faraway'
## The following object is masked from 'package:lattice':
## 
##     melanoma
vif(model_add_num)
##   `Overall Qual` `Year Remod/Add`   `Mas Vnr Area`  `Total Bsmt SF` 
##         2.472044         1.680024         1.308028         1.517613 
##    `Gr Liv Area`      `Full Bath`  `TotRms AbvGrd`       Fireplaces 
##         4.487689         2.098590         2.970419         1.339686 
##    `Garage Cars` 
##         1.930246

There are no values over 5 which is the heuristic used in the course book that should cause concern. It seems that we removed the right numerical variables.

We can apply the get_overview function to evaluate assumptions as well:

get_overview(model_add_num)

## $loocv_rmse
## [1] 0.1416273
## 
## $adj_r2
## [1] 0.8586069
## 
## $bp_p_value
##           BP 
## 4.968603e-14 
## 
## $shapiro_p_value
## [1] 6.35861e-23
## 
## $betas
## [1] 10

Everything looks similar to the additive model using all the predictors. This time the LOOCV RMSE is not inf which is also a good signal (no hat values of 1).

Let’s now check whether transforming any of these predictors makes sense. A visual inspection of pairs plot might suggest possible transformations.

pairs(train_hd[, names(sale_price_cor)])

Based on this output, it seems that trying a log transformation on Gr Liv Area and Total Bsmt SF can make sense to soften the effect ot extreme values:

model_add_num_log = lm(log(SalePrice) ~ `Overall Qual` + `Year Remod/Add` + `Mas Vnr Area` + log(`Total Bsmt SF`) + log(`Gr Liv Area`) + `Full Bath`  + `TotRms AbvGrd` + `Fireplaces` + `Garage Cars`, data = train_hd)

get_overview(model_add_num_log)

## $loocv_rmse
## [1] 0.1442976
## 
## $adj_r2
## [1] 0.8532499
## 
## $bp_p_value
##           BP 
## 2.673067e-11 
## 
## $shapiro_p_value
## [1] 8.29974e-21
## 
## $betas
## [1] 10

After the transformation the assumption parameters improved and the rest remain almost the same. We can keep these changes.

Let’s now try to simplify the model using a backward selection procedure and AIC.

model_add_num_log_back = step(model_add_num_log, direction = "backward", trace = 0)
model_add_num_log_back$coefficients
##          (Intercept)       `Overall Qual`     `Year Remod/Add` 
##         0.2040155504         0.0924130820         0.0036311220 
##       `Mas Vnr Area` log(`Total Bsmt SF`)   log(`Gr Liv Area`) 
##         0.0001084771         0.1973170774         0.3516903192 
##          `Full Bath`           Fireplaces        `Garage Cars` 
##        -0.0135658961         0.0640742108         0.0645105416
get_overview(model_add_num_log_back)

## $loocv_rmse
## [1] 0.1442363
## 
## $adj_r2
## [1] 0.8532733
## 
## $bp_p_value
##           BP 
## 6.120295e-12 
## 
## $shapiro_p_value
## [1] 9.024409e-21
## 
## $betas
## [1] 9

The number of rooms above the ground were removed from the model. Let’s confirm that this is not a significant removal using an ANOVA test:

p_val = anova(model_add_num_log_back, model_add_num_log)$"Pr(>F)"[2]
p_val
## [1] 0.4162366

We failed to reject the null hypothesis for any rational significance level, which means that the simplified model can be used and the number of rooms above ground were not significant. We now have only 9 numerical variables.

Let’s see whether this model can be improved by adding categorical variables. This can be done using a forward selection procedure and BIC to get the smaller possible model:

# The scope includes all the filtered numerical variables plus all the available categorical variables
model_add_mix = step(model_add_num_log_back, direction = "forward", scope = SalePrice ~ `Overall Qual` + `Year Remod/Add` + `Mas Vnr Area` + log(`Total Bsmt SF`) + log(`Gr Liv Area`) + `Full Bath` + `Fireplaces` + `Garage Cars` + Street + `MS SubClass` + `MS Zoning` + `Lot Shape` + `Land Contour` + Utilities + `Lot Config` + `Land Slope` + Neighborhood + `Condition 1` + `Condition 2` + `Bldg Type` + `House Style` + `Roof Style` + `Roof Matl` + `Exterior 1st` + `Exterior 2nd` + `Mas Vnr Type` + `Exter Qual` + `Exter Cond` + Foundation + `Bsmt Qual` + `Bsmt Cond` + `Bsmt Exposure` + `BsmtFin Type 1` + `BsmtFin Type 2` + Heating + `Heating QC` + `Central Air` + Electrical + `Kitchen Qual` + Functional + `Garage Type` + `Garage Finish` + `Garage Qual` + `Garage Cond` + `Paved Drive` + `Sale Type` + `Sale Condition`, k = log(length(resid(model_add_num_log_back))), trace = 0)

We can now run our diagnostics for the obtained model:

get_overview(model_add_mix)

## $loocv_rmse
## [1] Inf
## 
## $adj_r2
## [1] 0.9185498
## 
## $bp_p_value
##           BP 
## 2.429563e-38 
## 
## $shapiro_p_value
## [1] 1.101009e-22
## 
## $betas
## [1] 89

The assumptions plots remain the same. There is a noticeable improvement on adjusted \(R^2\) and our LOOCV RMSE is inf again. This is due to some hat values being 1. Let’s explore how many we have:

sum(hatvalues(model_add_mix) == 1)
## [1] 4

Since the number of conflicting hat_values is so low we can simply ignore them and calculate LOOCV RMSE manually:

loocv_rmse_final_mod = sqrt(mean((resid(model_add_mix)[-which(hatvalues(model_add_mix) == 1)] / (1 - hatvalues(model_add_mix)[-which(hatvalues(model_add_mix) == 1)])) ^ 2))
loocv_rmse_final_mod
## [1] 0.1114185

This is a much better value than our previous model, which is super good for prediction purposes. The model_add_mix is our final choice. We’ll assess it against the test set in the results section. We’ll also discuss the chosen model in general.

Results

Predictive model chosen

The chosen model to predict Sale Price for the given dataset was stored in model_add_mix. It uses 20 predictors out of 80 originally available, which is aligned to our initial commitment towards not only getting a good model for prediction, but also a balanced model between prediction and explanation. 12 of those predictors are categorical and 8 numerical.

To assess the final model, we can first focus on the prediction capabilities and then around explanatory capabilities.

From a prediction perspective we can first plot the fitted values versus the actual values to get a visual look of how the chosen model performs on the training set:

plot(train_hd$SalePrice, exp(fitted(model_add_mix)),
     xlab = "Actual price (dollars)",
     ylab = "Predicted price (dollars)",
     main = "Predicted vs actual price in dollars for training set",
     col = "orange")
abline(0, 1, lwd = 2, col = "dodgerblue")

ADD COMMENTS HERE (variance for high prices seem to increase. Model looks really good below 400 000 dollars)

Let’s do the same but this time for the test set. Let’s store the predicted values in a variable to be used on calculating more results.

# We are removing a conflicting observation for "Kitchen Qual" that is only present on the test set
# This is because there is only one observation with that value
test_hd = test_hd[-which(test_hd$`Kitchen Qual` == "Po"),]

# Predict values
predicted = exp(predict(model_add_mix, test_hd))

We can now plot for the test set. These are unbiased results that can tell us a lot about the chosen model:

plot(test_hd$SalePrice, predicted,
     xlab = "Actual price (dollars)",
     ylab = "Predicted price (dollars)",
     main = "Predicted vs actual price in dollars for test set",
     col = "orange")
abline(0, 1, lwd = 2, col = "dodgerblue")

Unmerged stuff from ANOVA

Methods

We will begin by loading the housing data stored in AmesHousing.csv. We will also define a convenience function splitDataset that will randomly split the data into two data sets: “train” and “test”.

splitDataset = function(data) {
  training = createDataPartition(data$SalePrice, times=1, p=0.80, list=FALSE)
  return(list(train=data[training,], test=data[-training,]))
}


splits = splitDataset(housing_data)
training_data = splits$train
test_data = splits$test

Now we can begin our exploratory data analysis. To examine the relationship between Neighborhood and SalePrice, we will perform a simple one-way ANOVA to determine whether belonging to different neighborhoods has any effect on SalePrice at all.

neighborhood_oneway_anova = aov(SalePrice ~ Neighborhood, data=housing_data)
neigh_oneway_anova_summary = summary(neighborhood_oneway_anova)

A simple linear model would allow us to infer whether the Ground Living Area (Gr.Liv.Area) of a house has any relationship with SalePrice.

livarea_model = lm(SalePrice ~ `Gr Liv Area`, data=housing_data)
livarea_model_summary = summary(livarea_model)

And, in order to infer whether or not a house’s living area is more expensive in some neighborhoods than others, we’ll perform Analysis of Co-Variance (ANCOVA) by including an interaction term between the continous variable Gr.Liv.Area and the categorical variable Neighborhood. If the slopes of Gr.Liv.Area are different (not parallel) for different neighborhoods, then we would notice a strong interaction between the two predictors.

## Workaround
housing_data_alias = housing_data
housing_data_alias$Gr.Liv.Area = housing_data$`Gr Liv Area`

neigh_livarea_interaction_model = lm(SalePrice ~ Gr.Liv.Area*Neighborhood, data=housing_data_alias)
neigh_livarea_anova = anova(neigh_livarea_interaction_model)
p_value_neigh_livarea_interact =anova(neigh_livarea_interaction_model)[3,5]

We will also estimate actual slopes of the lines of SalePrice vs. Gr.Liv.Area for different neighborhoods to examine any differences.

livarea_slopes_raw = sim_slopes(neigh_livarea_interaction_model, pred=Gr.Liv.Area, modx=Neighborhood,
                        johnson_neyman = FALSE,centered= "none", confint=T)

livarea_slopes = data.frame(livarea_slopes_raw$slopes[,1:5],stringsAsFactors=FALSE)
livarea_slopes[,2:5] = sapply(livarea_slopes[,2:5], as.double)
livarea_slopes[,2:5] = sapply(livarea_slopes[,2:5], signif, digits=4)

colnames(livarea_slopes)[1] = "Neighborhood"
colnames(livarea_slopes)[3] = "Std. Err."
colnames(livarea_slopes)[4] = "2.5%"
colnames(livarea_slopes)[5] = "97.5%"

Lastly, in order to determine whether the effect of Gr.Liv.Area on SalePrice is not explained by other predictors, we’ll calculate the partial correlation coefficient between them with the effect of all other predictors removed.

Results

Neighborhood Effect on Living Area Price

From out first ANOVA of Neighborhood and SalePrice, it can easily be seen that the neighborhood had a significant effect on the price. We can also see this graphically using a box-plot.

neigh_oneway_anova_summary
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## Neighborhood   25 9.156e+12 3.662e+11   136.3 <2e-16 ***
## Residuals    2645 7.105e+12 2.686e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(SalePrice ~ Neighborhood, data = housing_data,las=2, col=2:8, pch=20, cex=.7,
        main="Sale Prices by Neighborhood",
        ylab="Sale Price (in US Dollars")

Similarly, it is easily shown that Gr.Liv.Area has an effect on SalePrice.

plot(SalePrice ~ `Gr Liv Area`, data = housing_data,
     xlab = "Living Area (square feet)",
     ylab = "Sale Price (US Dollars)",
     main = "Sale Price vs Ground Living Area",
     pch  = 20,
     cex  = .5,
     col  = "grey")
abline(livarea_model, col="darkorange", lwd=2)

However, we also observe that the interaction model between Neighborhood and Gr.Liv.Area yielded a near-zero p-value of 9.87210^{-75}. This suggests a significant interaction between the two predictors when predicting SalePrice.

signif(p_value_neigh_livarea_interact,4)
## [1] 9.872e-75
print(neigh_livarea_anova[-4,], signif.stars=FALSE)
## Analysis of Variance Table
## 
## Response: SalePrice
##                          Df     Sum Sq    Mean Sq  F value    Pr(>F)
## Gr.Liv.Area               1 8.7037e+12 8.7037e+12 6754.782 < 2.2e-16
## Neighborhood             25 3.5875e+12 1.4350e+11  111.370 < 2.2e-16
## Gr.Liv.Area:Neighborhood 25 5.9450e+11 2.3780e+10   18.455 < 2.2e-16

Because of this, we obtain different slopes for Gr.Liv.Area depending on Neighborhood. Each of the slope estimates are the sum of the Gr.Liv.Area coefficient and its interaction term with the corresponding Neighborhood factor dummy variable. As with all estimates, we can calculate a mean and a confidence interval.

## Get the top 5 neighborhoods with the most houses
ordered_neigh = order(-table(housing_data$Neighborhood))

interact_plot(neigh_livarea_interaction_model, pred=Gr.Liv.Area,
              modx=Neighborhood,color.class="Qual1",
              modxvals = names(table(housing_data$Neighborhood)[ordered_neigh][1:6]),
              x.label="Ground Living Area (square feet)",
              y.label="Sale Price (US Dollars)") +
  ggtitle("Gr.Liv.Area Slopes of the 5 Neighborhoods with Most Sales" ) +
  theme(plot.title = element_text(hjust = 0.5))

Figure X: FOOOBAR

kable(livarea_slopes)
Neighborhood Est. Std. Err. 2.5% 97.5%
Blmngtn 130.50 45.260 41.7000 219.20
Blueste -14.07 71.290 -153.9000 125.70
BrDale 58.43 46.600 -32.9500 149.80
BrkSide 72.45 11.330 50.2400 94.67
ClearCr 56.54 14.830 27.4500 85.63
CollgCr 106.10 5.434 95.4200 116.70
Crawfor 94.77 7.872 79.3400 110.20
Edwards 89.47 8.154 73.4800 105.50
Gilbert 77.71 9.873 58.3500 97.07
Greens 100.00 81.290 -59.3600 259.50
IDOTRR 55.44 10.810 34.2400 76.64
MeadowV 38.62 21.180 -2.9100 80.15
Mitchel 67.29 9.513 48.6300 85.94
NAmes 54.61 4.540 45.7100 63.51
NoRidge 140.00 11.060 118.4000 161.70
NPkVill 25.16 30.690 -35.0200 85.34
NridgHt 163.70 6.299 151.3000 176.00
NWAmes 67.46 7.250 53.2400 81.67
OldTown 60.69 4.769 51.3400 70.04
Sawyer 38.81 8.763 21.6300 55.99
SawyerW 84.56 7.220 70.4000 98.71
Somerst 123.10 8.449 106.5000 139.70
StoneBr 160.30 8.595 143.5000 177.20
SWISU 40.64 11.330 18.4200 62.86
Timber 105.70 12.060 82.0700 129.40
Veenker 27.89 14.470 -0.4871 56.26

Table X: FOOBAR

Discussion

Relationship between Neighborhood, Sale Price and Other Predictors

Having observed how both Neighborhood and Gr.Liv.Area interact with each other and influence SalePrice, it is important to highlight that it does not imply a casual relationship between them. This becomes more evident when we examine the partial correlation coefficient of Gr.Liv.Area and SalePrice with the effect of all other 78 predictors removed. We observe that its magnitude is less than 0.01.

without_livarea       = lm(SalePrice ~ . -`Gr Liv Area` , data=housing_data)
livarea_collinearity  = lm(`Gr Liv Area`~ . -SalePrice, data=housing_data)
(partial_corr_livarea_saleprice = cor(resid(without_livarea),resid(livarea_collinearity)))
## [1] 0.003903368

Therefore, ground living area is collinear with many of the other predictors available in the dataset, and does not independently predict price. We can, however, reject the null hypothesis that the average price per square foot of living space is equal for all neighborhoods with a confidence level \(\alpha= 0.01\).

References

  1. De Cock, Dean. (2011). Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education.
  2. House Prices: Advanced Regression Techniques. Obtained on July 9th, 2018, from: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
  3. Dalpiaz, D. (2018). Applied Statistics with R. Obtained on July 9th, 2018, from: https://daviddalpiaz.github.io/appliedstats/
  4. De Cock, Dean. (2011). Ames, Iowa house price dataset. Obtained on July 9th, 2018, from: http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls
  5. De Cock, Dean. (2011). Ames, Iowa house price dataset description. Obtained on July 9th, 2018, from: https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt